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Abstract 

We describe a suite of validation metrics that assess the credibility of a given automatic spike sort¬ 
ing algorithm applied to a given electrophysiological recording, when gronnd-truth is unavailable. By 
rerunning the spike sorter two or more times, the metrics measure stability under various perturbations 
consistent with variations in the data itself, making no assumptions about the noise model, nor about the 
internal workings of the sorting algorithm. Such stability is a prerequisite for reproducibility of results. 
We illustrate the metrics on standard sorting algorithms for both in vivo and ex vivo recordings. We 
believe that such metrics could reduce the significant human labor currently spent on validation, and 
should form an essential part of large-scale automated spike sorting and systematic benchmarking of 
algorithms. 


1 Introduction 

One of the most powerful and widely used methods for studying neuronal activity in vivo (for example, in 
behaving animals) or ex vivo (for example, in extracted retinal preparations) is direct electrical recording. 
Using either single electrodes, tetrodes, or high-density multielectrode arrays, the experimental data consist 
of voltage patterns measured by sensors at the electrode tips, which are typically in the extracellular space. 
The strongest signals arise from an unknown but modest number of neurons in the immediate vicinity of 
each sensor, superimposed on a background consisting substantially of signals due to more and more distant, 
electrically shielded neurons [3]. Spike sorting is the name given to algorithms that detect distinct firing 
events and associate those events with specific neurons. The input consists of voltage traces from one or 
more electrodes and the output is a list of individual neurons that have been identified, as well as the spiking 
(firing) times for each. The literature on this subject is vast, and a very partial list of references includes 

[iiaiTniiiiiiiziiiaiig]. 

In this environment, a typical data processing pipeline consists of (1) filtering the signal to remove 
high frequency noise and low frequency baseline drift, (2) spike detection, windowing and alignment, and 
(3) clustering based on the spike shape. The latter step, which is typically the core of the spike sorting 
algorithm, is predicated on the critical assumption that the electrical signals from distinct neurons have 
distinct “shapes,” discussed in more detail below. Finally, once the algorithm has assigned a neuron identity 
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to each spike, various metrics are employed to assess the quality of the classification. In many experiments, 
the signals are, in fact, nonstationary. That is, the voltage pattern due to the firing of each neuron may vary 
over time. For the sake of simplicity, we will restrict our attention here to stationary data, but note that the 
framework for quality assessment that we introduce below does not depend on stationarity 

Unfortunately, despite major progress in algorithms and software, spike sorting remains a labor-intensive 
task with a substantial manual component, both in the clustering and quality assessment phases. Fur¬ 
thermore, in the last two decades, the number of channels from which it is becoming possible to record 
simultaneously has grown from one (or a few) to thousands [SI [THl [571 [3T]. As a result, it has become an ur¬ 
gent matter to accelerate the data analysis, to automate the clustering algorithms, and to develop statistical 
protocols that can provide robust estimates of the accuracy of the neuronal identification. 

Despite some important work on validation, discussed in the next section, however, there are currently no 
established standards in the community either for estimating errors or for assessing the reproducibility of the 
output of spike sorting software. Moreover, as noted in [laiiiiMlEa, it is of particular importance to be 
able to assess the fidelity of the output for each of the identified neurons separately. This is vital information 
in subsequent modeling, since the error rate from the spike sorting phase may permit the inclusion of 
some neurons and the exclusion of others when making inferences about neural circuitry, depending on the 
sensitivity of the question being asked. 

In this paper, we concentrate on the problem of quality assessment, and propose a framework for estimat¬ 
ing confidence in the results obtained from a black box spike sorting algorithm, in the absence of ground truth 
data such as an intra-cellular reference electrode. Since rigorous estimates of accuracy are not obtainable in 
this environment, we develop statistical measures of stability instead, drawing on ideas from bootstrapping 
and cross-validation in statistics and machine learning [sniiiniiiiKisiissi- One of the important features of 
our framework is that quality estimation is carried out in an algorithm-agnostic fashion, not as an internal 
part of the spike sorting algorithm itself. This requires a standard interface for the data that is compatible 
with all (or most) algorithms, and the ability to invoke the spike sorting procedure without manual inter¬ 
vention. Our validation scheme, in fact, needs to be able to re-run the spike sorting software two or more 
times; see Fig. [TJ 

We hope that the stability metrics and neuron-by-neuron confidence measures described here will serve, 
in part, as a step toward systematizing the external evaluation of spike sorting algorithms and, in part, as 
motivation to fully automate the spike sorting process. Indeed, we see this as crucial to progress in the 
analysis of increasingly large scale electrophysiology data. Finally, we should note that we do not claim to 
present a novel algorithm for spike sorting—we will illustrate our metrics using conventional clustering and 
time series analysis tools. 

The structure of this paper is as follows. In the remainder of the introduction we overview some existing 
approaches to validation ISec. lI.lL then describe lSec. ll.2|l the two versions of spike sorting that we consider: 
the simpler sorting of individual spikes or “clips” (which have already been detected and aligned), and the 
more realistic sorting of a full time series. Standardizing the interfaces to these two algorithms is crucial for 
wide applicability of our metrics. Sec. [5] presents four schemes of increasing complexity for validating spike 
sorting on clips, and illustrates them on a standard PCA/k-means sorting algorithm applied to in vivo rodent 
data. Then Sec. [3| presents two schemes for validating the spike sorting of time series, and illustrates them 
on ex vivo monkey retina data. We discuss the implications and conclude in Sec. |3| Some methodological 
and mathematical details are gathered in the two appendices. 
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1.1 Prior work on quality metrics 

As noted above, there has been substantial progress made over the last decade or so in developing methods 
for assessing the output of existing spike sorting methods. Broadly speaking, these consist of (a) statistical 
or information-theoretic measures of “isolation” of the clusters of firing events associated with each neuron 
and (b) physiological criteria for detecting errors such as refractory period violations. False positives are 
defined as firing events that are associated with neuron j, but should have been grouped with some other 
neuron i, with i ^ j. False negatives are defined as firing events that should have been associated with 
neuron j, but were either discarded or grouped with some other neuron. We do not seek to review these 
approaches here, and refer the reader to [iniEsiiMiEg. While the establishment of such particular metrics 
is important, those of type (a) have until this point relied on accessing internal data (eg, PCA components) 
that are particular to certain types of algorithms while excluding many other successful approaches (eg, ICA, 
model-based fitting, and optimal filters). 

Spike sorting algorithms can, of course, be run on simulated data, where ground truth is available. Thus, 
the development of more and more realistic forward models for neuronal geometry, neuronal firing, and 
instrumentation noise is an important goal [1]. Recently another approach has emerged, called “surrogate” 
spikes m or “hybrid datasets” [3T], consisting of the original voltage recordings, to which are added new 
spikes (known waveforms extracted either from different electrodes in the same recording [21) . or from a 
different recording with the same equipment EH), at known times. Running the spike sorting procedure 
again on the new dataset provides a powerful validation test, particularly since it permits testing fidelity in 
the presence of overlapping spikes, which tend to result in errors using simple clustering schemes. 

Despite all of these advances, however, most algorithms still involve manual intervention either at late 
stages of the clustering process or in quality assessment. While there have been some attempts to validate 
algorithms with human-operated steps m, we believe that limiting human intervention to an early, ex¬ 
ploratory phase, followed by a fully automated execution of the spike sorting software will be essential for 
establishing standards of reproducibility and quality, particularly with high-density multi-electrode arrays. 

The related, and broader, problem of quality metrics in clustering algorithms has received much recent 
attention [M ED [Ml [la. However, many such metrics are not directly applicable to spike sorting, since 
they require access to internal variables that may be present in some algorithms and not others, and since 
they usually seek to assess the overall quality of the clustering (eg see [3^; for an exception, see [12] )• In 
contrast, with spike sorting the concept of “overall quality” is not useful: there are often high-amplitude 
spikes that cluster well, plus a “tail” of spikes of decreasing amplitudes that cluster arbitrarily poorly, 
becoming indistinguishable from measurement noise. Thus, as realized by many researchers (eg [321 125)1. 
per-neuron metrics (as opposed to overall performance metrics) are crucial. 

1.2 Set-up of the problem and validation schemes 

We consider interfaces to two spike sorting tasks (Fig.|Tla)-(b)). The first is a simple classification of isolated 
events, which will serve to introduce ideas, whereas the second is a rather general spike sorting interface: 

1. The sorter classifies clips, ie short time-windows, each presumed to contain a single spiking event. 
We assume that the clips have already been extracted (by some detection process) and aligned (so 
that events of the same type occur at the same time within the clip). For each clip, an integer label 
is produced. More formally, let X := t=i,...,T, j=i,...,N be a three-dimensional array 

with N clips, each of duration T time samples and M channels (electrodes). The sorting algorithm S 
performs the map 

S{X) = =: k (1) 
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(c) validation scheme overview: 


(a) interface for spike sorting of clips (single spiking events): 

T 


I M 



(neuron) labels 


(b) interface for spike sorting of full time series: 
T tot 



time (samples) 


black box spike sorter 


[kj] j=i,..,N, 

labels 

{tj} j=l,..,N, 

firing times 



per-neuron metric confusion matrix 


Figure 1: Interfaces and validation of automatic spike sorting algorithms. The algorithm S to be validated 
is inside the “black box”, and is only accessible via the interface (a) or (b). In (a) the spike sorter assigns 
the jth clip a label (neural unit) kj from I to K. In (b) the spike sorter acts on a whole time series Y 
and returns a number Ng of firing events (determined by the algorithm) described by labels kj and firing 
times tj in [0, (c) Overview of the validation scheme, applying to both clip-based and time series based 

algorithms, requiring the ability to rerun the algorithm S as needed, and outputting the stability metric fk- 


assigning a label (neuron identity) kj G {1,..., K} to the jth clip. K may be a fixed internal parameter 
of the algorithm, or may be a variable to be determined by the algorithm. 

2. The sorter identifies all firing times and labels within a time-series Y = {ymt}m=i,...,M, t=o,...,Ttot-i: i® 

S{Y) = {t„k,}f^, ( 2 ) 

where Ng is the number of spikes found (determined by the algorithm), while tj G [0,Tt„t] and kj G 
{1,..., iF} are their real-valued firing times and integer labels. In contrast to the clip-based interface, 
this allows S to handle overlapping spiking events, which is crucial for complete event detection when 
firing rates are high [IZlllSllT]. 

The former interface captures the clustering task common to much software [niiiuis]- The latter en¬ 
compasses essentially all types of automatic sorting algorithm^, including those based on template matching 
[minimm], ICA [Sllig or mters 0, that have no simple detection step. Note that in neither interface are 
waveform shapes output, since these can be reconstructed from the inputs and outputs. We require that any 
algorithm parameters (thresholds, clustering settings, electrode information, etc) are fixed, and inaccessible 
via the validation interface. 

Our key output comparison tool is the following matrix (see Fig. [Dc)). Given two size-fV lists of labels, 
kj G {1,..., K}, j = 1,... ,N, and Ij G {1,..., L}, j = 1,... ,N, the confusion matrix (or contingency table 

^Probabilistic (fuzzy) algorithms that output a probability density over spike parameters such as [391 [5] could be adapted 
to our interface by selecting as output either the mode of the density, or a random sample from it. 
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Figure 2: Spike-sorting of clips extracted from a rat motor cortex recording, using PCA/k-means-|—I- (see 
section El with K = 8 and best of r = 100 repeats, (a) Average waveforms for k = 1,... ,K, (see 
®) with their populations (numbers of firings) shown below. Each of the M = 10 channels is shown 
as a separate trace. The upwards-spiking channels are a result of spatial pre-whitening (App. lA.ll) . (b) the 
labeling, each shown as a point in the feature space of the first three PCA components. Colors indicate 
label kj as in the inset, (c) 20 samples of the “rerun” stability metric of Sec. 12.21 for PCA/k-means-|--l- with 
K = 8 and r = 10. The large dot (and percentage below) shows the mean the small dots show all 

samples of and the bar shows their 25% to 75% quantiles. 


[m Ch.l7]) is a rectangular matrix Q whose k, I entry is the number of spikes labeled k in the first list and 
I in the second, ie 

Qk,i ■■= #{j : kj = k,lj = 1} . (3) 

Often the ordering of the labels is arbitrary, in which case we seek the “best” permutation of one of the sets of 
labels, in the sense of maximizing the sum of diagonal entries of the permuted Q matrix. This best-permuted 
confusion matrix Q is defined by 


min(i^,L) 

Qk,i ■■= Qk.fiii) , where tt = argmax ^ Qk,-K{k) (4) 

fe=i 

where iFl is the set of permutations of L elements. Large diagonal entries mean that the two sets of labels 
are consistent; off-diagonal entries indicate the amount and type of variation between the two labelings. 
The assignment problem of finding tt can be solved in polynomial time by applying Kuhn’s “Hungarian 
algorithm” [15] to —Q. When firing times are also present, we will need an extended confusion matrix, 
which we postpone until section [Sj 

2 Clip-based validation schemes 

In this section we present some clip-based validation schemes, proceeding from simple to more sophisticated. 
Since there is no ground truth data, we must focus on the idea of stability. We make the point that if the 
labeling of a spike is unstable (with respect to resampling or perturbations consistent with the noise), then 
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it is almost certainly not accurate. In other words, stability provides an upper bound on accuracy. We 
illustrate this on a set of 6901 upsampled and peak-aligned clips extracted from an in vivo rat motor cortex 
recording, as described in Appendix lA.ll 

2.1 A standard clip-based spike sorting algorithm 

Since our goal is to validate existing algorithms rather than present new ones, we use as our default a 
standard spike-sorting algorithm that we refer to as PCA/k-means-|—h, as follows. Clips are organized 
into a matrix A G such that each column contains the time-series for all channels for one clip. 

Dimension reduction (from dimension MT = 1470 to dimension = 10 by default) is then done by 
replacing each column by its first iV^a PCA component^. Then, treating each column (clip) as a point in 
l^Wfea, k-means-|--|- [1] is used for clustering, which requires a user-specified K. We remind the reader that 
k-means-|--|- is a variant of k-means using a certain favorable random initialization of the centroids. The 
converged clustering produced often depends on initialization, ie the k-means iteration is not often able to 
find the global minimum of the sum of squared distances from points to their assigned centroids (this is in 
general AP-hard m)- Thus, as is standard, we repeat k-meansH—h r times and choose the repeat with the 
minimum sum of squared distances. The result is a label kj for each clip. The labels are permuted so that 
the Z 2 -norms of the estimated spike waveforms, ie ^ decrease with increasing k. For this the 

estimated spike waveforms are found by simple averaging over all clips which have been assigned the 

same label k. More formally, 

= — E = k = l,...,K (5) 

j:kj=k 

where 

Uk = n/c(k) := #{j : kj = k} (6) 

is the fcth population in the labeling k. 

Fig. [IKa)-(b) shows the result of this sorting algorithm, for K = 8 neurons and r = 100 repeats, running 
on the clip dataset. Note that we do not claim that this is an optimal algorithm; our goal is merely to use 
it to illustrate the presented validation schemes. 


2.2 A stability metric based on rerunning 

Our goal is to validate a spike sorting algorithm acting on a particular dataset of clips. If the spike sorter 
is non-deterministic, then comparing two runs on exactly the same clip data (with different random seeds) 
gives a baseline measure of reproduceability for each label. Let {kj} = k and {Ij} = 1 be the sets of labels 
returned from two such runs, and assume that they have the same number of labels K. Let Qk,i be the 
elements of their best-permuted confusion matrix ([4]). Then define for each label k the stability^ 


fk ■■= 


^Qk,k 

nfc(k) -I- nfc(l) 


k = l,...,K. 


(7) 


^ Numerically, A is replaced by the first A7ea columns of where V is the matrix whose columns are the eigenvectors 

of AA'^ ordered by descending eigenvalue. 

®This is similar to the “F-measure” of a clustering 1411 Eq. (17.1)], or harmonic mean of the precision and recall, relative to 
a ground truth clustering, with the difference that a single matching is enforced between the two sets of labels. 
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Figure 3: The “rerun” stability metric for PCA/k-means++ with K = 8 fixed, for various r (best of r 
repeats of k-means++). (a) r = 1, (b) r = 20, (c) r = 100. Plots should be interpreted as in Fig. [2Kc). This 
shows convergence of all label stabilities to 1 (hence an uninformative metric) as r increases. 


Samples of fk are now computed by performing this process for many independent pairs of runs, with the 
same K, and the mean fk is reported. We use the notation fk^"" to indicate the “rerun” stability metric. 
Note that the pairs of runs used to sample the metric should not be confused with the r “repeats” that are 
internal to this particular sorting algorithm S. 

We illustrate this for our default sorting algorithm (using the best of r = 10 k-means++ repeats) in 
Fig. EJc), which shows 20 independent samples of with quantiles and means. This shows that the 
first four spike types are stable (as expected from the large waveform norms in panel (a) and well-separated 
clusters in (b)), while the last four are less so. The last {k = 8) is the least stable, and from its tiny waveform 
it might be judged not to represent a single neural unit. 

Remark 1. We argue that 20 samples of any metric fk are more than sufficient for validation, beeause the 
resulting standard deviation of the estimator of the mean is then somewhat smaller than the width of the 
underlying distribution of fk- If a neuron has a wide distribution of fk, it is unstable, and high accuracy in 
estimating fk is not needed. This follows wisdom from the Monte Carlo literature \20[ Sec. 7.5]. 

However, as Fig.[3]shows, as a larger number of repeats r is used, the stability for each label k = 1,... ,K 
grows. By r = 100 all stabilities have converged to very close to 1, ie there is very little variation in 
labeling due to random initialization. This is because picking the best of many repeats finds a sum of 
squares distances to centroids that is closer to the global minimum. The algorithm has effectively become 
deterministic, hence the rerun metric is uninformative. Since we wish to validate deterministic spike sorting 
methods (eg, ones based on hierarchical or density-based clustering [HI Ch. 14-15]), it is clear that one must 
introduce variation into the input X. We now turn to such methods. 

2.3 A cross-validation metric using subsampling 

In machine learning it is common to use cross-validation (CV) to assess an algorithm. The simplest version 
splits the data into training and test subsets, and uses the performance on the test set to estimate the 
generalizability of a classifier built using the training set. Although CV requires an observed dependent 
variable, a similar idea can be applied to spike sorting evaluation even in the absence of ground truth 
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Figure 4: Three-way cross-validation stability metric for PCA/k-means-|--|- (best of r = 100 repeats), (a) 
Stabilities for K = 8, and populations from a single run of the sorter, (b) The two PCA components of the 
“toy” data (extracted from the labels k < 3) with N = 1584, labeled with PCA/k-means-|--|- with K = i 
and = 2. (c) Estimated waveforms from the run shown in (b). (d) Stability metrics for the K = A labels 
on the data from (b). (e) Same for an expanded dataset with a 100 times larger N. 


information. We divide the N clips randomly into three similar-sized subsets 1, 11, and 111, and train a 
classifier Ci with only the data 1, and a classifier Cu with only the data 11. Finally we compute the best- 
permuted confusion matrix (|4]) between the labels produced by classifying (labeling) the data 111 using Cj 
and using Cu, then from this compute stabilities for each spike type, as before using 0. Other more 
elaborate variants involving m-fold CV are clearly possible; we do not test them here. 

We now describe the classifier. Our clip-based interface Fig. UKa) does not include a classifier, yet we can 
build a simple one by taking the label of the mean waveform that is closest (in some metric) to each 
clip in question. That is, given a set of clips X which gives spike-sorted labels k = S{X), the estimated 
waveforms k = 1,..., AT are computed via they form the information needed for the classifier. 

Applying the classifier to new clips X' := gives new labels according to 

fc' = argmin , 

where for simplicity the I 2 norm has been chosen as the distance metric0. Finally, since the waveform norm 
ordering of Sec. 12.11 may vary for independent subsamplings, then in order to collect meaningful per-label 
statistics of we must permute the labelings from later repetitions to best match the first. For this we 
apply the Hungarian algorithm to the matrix of squared distances (Z 2 -norms) between all pairs of mean 
waveforms in the repetitions in question. 

Fig. mja) shows this stability metric (for 20 independent 3-way splits) for the PCA/k-means-|--|- spike 
sorter with K = 8 and r = 100. Notice that the last four spike types have stabilities in the range 70- 
90%, in contrast to Fig. me) for which they were all 100%. This shows that the three-way data subsampling 
introduces variation that detects unstable label assignments, even for an essentially deterministic spike sorter. 

We now show that the 3-way CV metric is able to identify an erroneously split cluster. For this we 
generate a simple “toy” dataset comprising just the N = 1584 clips with labels kj = 1, 2 or 3 from the 


“^This is appropriate for an iid Gaussian noise model; a more realistic noise model is described in App. Also see Eg. 























default K = 8 and r = 100 spike sorting (Fig. [2]). Thus there are three very well-separated true clusters 
in feature space with a clear ground-truth; see Fig. SKb). For this toy dataset we pick a standard sorting 
algorithm of PCA/k-means-|--|- with K = A (chosen to be erroneously large), r = 100, and = 2 (to aid 
visualization). Running this algorithm on the toy data gives the labeling in two dimensions of feature space 
shown in Fig.|3Kb), and waveforms of Fig. |3Kc): the largest true cluster is erroneously split into two nearly- 
equal pieces, which are given new labels k = 2,3 (hence the similar pair of waveforms in (c)). There are many 
ways to summarize accuracjH here [411 Sec. 17.3.1]: filling the confusion matrix ([4]) of size 3x4 between 
truth and the K = A clustering gives the F-measures © for the three true labels as /2 = 0.701, f\ = /a = 1. 
Running the CV metric on this K = A sorting algorithm results in the stabilities in Fig.0Kd): a mean around 
87% for k = 2,3 (somewhat higher than measures of accuracy), and 100% for the other (unsplit) clusters. 
The point of our framework is that this conclusion is reached via only the standard interface S, ie without 
access to the feature vectors shown in (b), nor of course the ground truth. 

However, the variation in labeling induced by subsampling is small compared to the known accuracy 
(one reason being that the small TVfea makes for high stability), and furthermore dies away as the number 
of clips grows, N —>■ oo. We illustrate this in Fig. HJe), which shows the 3-way CV metric for the same 
algorithm as (d) but running on a set of clips 100 times larger, generated as described in App. IA.2I The 
stabilities of the k = 2,3 labels of the split cluster are now around 98.5%, hardly an indication of instability. 
This is consistent with a central limit theorem proven [33] for stability metrics in a wide class of clustering 
algorithms, when their global optimum is unique, implying in our setting that 

1 — for each k . 

Indeed, increasing V by a factor of 100 decreased the deviation of fk from unity by around a factor of 
10. We expect the prefactor for this deviation to be smaller when the split cluster is more non-spherical: 
resampling induces high instability in splitting a spherical cluster, but in k-means a non-spherical cluster (as 
in Fig.|D]b)) tends to split consistently along a hyperplane perpendicular to its longest axis. 

This unfortunate phenomenon that, regardless of the apparent quality of the labeling, stability tends to 
unity as N grows has been known in the clustering literature for some time [14] . Renormalizing a global 
stability metric by \/]V is possible [36l Sec. 3.1.1], but it is not clear how to do this when a per-neuron 
metric is needed. On the other hand, restricting to small-V subsamples to induce instability has the major 
disadvantage that the spike sorter is then never validated with any dataset of the desired size. 

Since we seek a scheme which can validate spike sorting applied to arbitrarily large datasets, we must 
move beyond subsampling, and explore perturbing the clips themselves. 

2.4 Metrics based on data perturbations without subsampling 

Fig. Illb)-(e) shows that for large N, stability under subsampling fails to indicate a highly erroneous split 
cluster. By examining feature space (Fig.HIb)) it is clear that a decision boundary of k-means passes through 
a high-density region. Can one create a metric that quantifies this undesirable property without access to 
feature space, using only the standard interface? We now present two such metrics, both of which involve 
perturbing the data. 

2.4.1 Self-blurring 

Firstly we describe a method that we name self-blurring for reasons that will become clear. Let k = S{X) 
be the output of the sorter on the full set of clips, and {W^'^^}k=i,...,K be the mean waveforms found via 

®A11 four clusters have precision 1, but fe = 2,3 have recall around 0.5. 
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(a) original toy data, K=4 (b) self-blurred, y=1 (c) noise reversed 



Figure 5: Data perturbation validation metrics illustrated on the “toy” clips (from labels k < 3), sorted with 
PCA/k-means-|--|- with iF = 4, r = 100, iVfea = 2. (a)-(c) shows the 2D PCA feature space, (a) Original 
points labeled by color as in Fig.|4Kb), showing k-means centroids and decision boundaries, (b) Points after 
self-blurring and (c) after noise reversal; colors show the original labelings, to highlight movement, and new 
decision boundaries, (d) Idealized ID picture: only the shaded parts of the distribution change labeling. 
Stability metrics are shown for (e) self-blurring and (f) noise reversal, for the original toy data size and 
a 100 times larger N. 
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Let TT be a permutation of the N clips that randomizes only within each label class. Precisely, if 
Jfe := {j : kj = k} is the index set of the /cth label type, and tt^ is a random permutation of Uk elements, 
then the overall permutation tt is defined by 

= Jfe( 7 rfe(i)), i = l,...,nk 

holding for each label type k = 1,..., K. From this we create a perturbed clip dataset X with elements 

Xmtj = Xmtj + liXjnt.nU) “ m = 1, . . . , M, t = 1, . . . , T, j = 1, . . . , ( 8 ) 

where 7 > 0 is a parameter controlling the size of the perturbation. Running the sorter on this new data 
gives k = S{X), then we compute the best-permuted confusion matrix (|3]) between labels k and k, and from 
it obtain the per-label stabilities /“"■■ via ©• 

The bracketed expression in (|S]) can be interpreted as a random sample from the “noise distribution” 
of the kjth. cluster in clip signal space, relative to its mean waveform (centroid). So, with 7 = 1 , the 
cluster distribution about its centroid is convolved (blurred) with itself. The result is that, even if a decision 
boundary remains fixed, if there is density near this boundary then points are carried across it by the 
convolution, thus change label, and are registered as instability. The new data X is consistent with the 
original X in the sense that it differs only by additive noise of exactly the type observed within each cluster. 
No assumption on the form of the noise model or noise level is made. 

Since (signal space) is hard to visualize, we again “peek under the hood” of our PCA/k-means-|--|- 
algorithm, by in Fig.jSJb) showing the result of self-blurring on the 2D feature vector points. Notice that there 
is significant overlap of red and green points, and rotation of one of the decision lines. The self-convolution 
also results in larger clusters, which may cause label exchange between clusters that were distinct (eg see 
the tails of the lower two clusters). For an isolated Gaussian cluster, its size (noise level) grows by a factor 
V 2 . Fig. [SJe) shows the distribution of for 20 samples of X. The mean around 75% for k = 2,3, being 
similar to the actual F-measure accuracy of 70%, is a clear indicator of instability. For the dataset of size 
lOOA^ this also holds; thus, we have cured the large-iV vanishing of instability present with CV. 

We show an idealized example where points are drawn from a symmetric distribution p(z) in a ID 
variable z, erroneously split into two clusters, in the top two pictures in Fig. EDd). The decision boundary 
is unchanged by self-blurring. The mass of each half of the distribution that changes label is shaded; this 
controls the size of 1 — /i'"'. If this distribution were a Gaussian, and 7 = 1, we can solve analytically (see 
App. |B]) that = 1 — [erf((27r)“^/^)]^ Ri 0.817. The same holds for a multivariate Gaussian in signal 
space split along a plane of symmetry. The reported stabilities in Fig. [51(e) also are similar to this value, 
even though clusters are not expected to be Gaussian [S]. 

Later we will vary 7 : a smaller 7 tends to increase all reported stabilities, but also causes less inflation 
of clusters (ie less increase in noise level), which may better reflect an algorithm’s stability for close, but 
distinct, clusters. 

2.4.2 Noise reversal 

Our second perturbation method has a similar flavor: using the above idea of noise (deviation) relative to a 
cluster centroid, we simply negate the noise. That is, we replace ® by 

Xmtj = - Xmtj , m = l,...,M, t=l,...,T, j = 1,...,N, (9) 

and proceed exactly as above to get Since this is a deterministic procedure, there is no need to resample; 
only two spike sorter runs are needed (the original and the noise-reversed). This is shown in Fig.|5l(c), and 
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Figure 6 : Perturbation metrics for default clips dataset with PCA/k-means++ with r = 100. (a) self¬ 
blurring ( 7 = 1 , shown in black) and noise reversal (shown in orange) metrics for K = 8. (b) self-blurring 
(7 = 0.5) metric for various iF = 2,..., 12 (each row a different K). (c) same for noise reversal metric 
(d) mean waveforms for K = 10, and (e) self-blurring scaled best confusion matrix 


lower picture of (d): each cluster is inverted about its centroid. Label exchange occurs in a split cluster 
because the tails of the distributions are made to point inwards, and bleed across the boundary. The results 
for this toy dataset appear in Fig. [SKe): the erroneously-split labeling induces around 89% stability for both 
the original iV and a 100 times larger N. This is more stable than both self-blurring and the true accuracy—a 
disadvantage—however its advantages are that the clusters are not artificially enlarged, that there are no 
free parameters, and no resampling. 

In the idealized ID Gaussian case, and the multivariate Gaussian split along a symmetry plane, analyti¬ 
cally (see App. ini) we get = erf(2/y^) k, 0.889, which is very close to the observed values, and explains 
why stabilities are expected to be closer to 1 than for self-blurring with 7 = 1 . 

2.4.3 Perturbation metric tests for the default algorithm 

We return to the full clips data from Sec. I2.1l with PCA/k-means-|--|- with the default Wea = 10 and r = 100. 
Fig. EKa) compares the self-blurring and noise reversal metrics for K = 8: we see that noise reversal has all 
stabilities roughly 3 times closer to unity than self-blurring at 7 = 1, but otherwise similar relative stabilities 
vs neuron label k. 

Since we cannot presume that if = 8 is optimal for extracting the largest number of stable neurons, in 
Fig. [Hi;b)-(c) we show the mean stabilities varying AT = 2,..., 12. Both show that the first four neurons are 
stable for most K values, and show that one or two others are also quite stable for AT > 8 (these are neurons 
fc = 7, 8 for AT > 10). Notice that, for noise reversal, no stabilities drop below 0.75, ie the stability range is 
compressed towards unity. 

We study AT = 10 in more detail, since for it, neuron fc = 5 shows a self-blurring stability close to 
zero. Fig. EKd) shows the mean waveforms for this AT; indeed /c = 4, 5 have very similar waveforms so 
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would probably be deemed erroneously split in practice. The (best-permuted) confusion matrix Q is a useful 
diagnostic: as the vertical “domino” in Fig. Ele) shows, k = 4 , and 5 have been merged by self-blurring. 
Off-diagonal entries such as Qe.g show that A: = 6 ,9 are prone to mixing, which is borne out by their visually 
similar waveforms. The point is that our metrics quantify such judgments, without a human having to 
examine whether waveforms are “close” given the particular noise (and hring variation) of the experiment. 

Finally, we observe that for K = 4, all supposed neurons are very stable (under both metrics), even 
though, with our knowledge of the larger number of distinct waveforms extracted for greater K, it is obviously 
not accurate. This illustrates that stability is necessary but not sujficient for accuracy. 

Remark 2 . Although we illustrated the self-blurring and noise reversal metrics with a label assignment 
(clustering) scheme involving hard decision boundaries (hyperplanes for k-means), we propose a broader view: 
since both metrics involve perturbing the clip data in a manner consistent with the noise, any reproducible 
neural unit output by a spike sorting algorithm should be stable under these metrics. 

3 Time-series based validation schemes 

We now turn to the second, more general, interface of Sec. 11.21 where the spike sorter extracts both times tj 
and labels kj from a time series Y of multi-channel recorded data. To compare two outputs of such a sorter, 

say {tj,kj}^f^ and where kj G {1, ■. ■ ,K}, k) G {1, ■. ■ ,K'}, and K and K' are the two (not 

necessarily equal) numbers of types of neurons found, one must match spikes from the first run to those in 
the second. A simple criterion is that matched firing times be within e of each other; we fix e = 0.5 ms in our 
tests. Specifically, let a matching ^ be a one-to-one majl^ from a subset of {1,... ,Ns} to an (equal-sized) 
subset of { 1 ,..., V'} such that 

\^j ~ ^ I fo'' j matched in the map. ( 10 ) 

Given any such pi, the confusion matrix is defined by 

Qk% = : mO) exists, kj = k, = k'} , ioi 1 < k < K, 1 < k' < K' . (11) 

Since there may be unmatched spikes j in the first run (for which we write p{j) = 0 ), and unmatched spikes 
j' in the second run (for which we write = 0), a row and column is appended for these cases to 

ie, 

Qi% = #0' : Kj) = 0, kj = k} , for l<k< K , (12) 

QSI = ff{j.ti-\j') = 0, k),=k'} , forl<k'<K', (13) 

and set ^ = 0. The {K -f l)-by-(Ar' -|- 1 ) matrix defined by (fTT]) - (fT^ we call the extended confusion 
matrix; see Fig. [71a). 

Since a sorting algorithm may arbitrarily permute the neuron labels, as before we need to hnd the “best” 
confusion matrix. However, now one must search over permutations tt of the second label and over matchings 
pL, ie find 

min(K,K') 

Qk,k' ■■= , where {A,^}= argmax V , (14) 

can also be expressed as a subgraph of the complete bipartite graph between Ns nodes and N'^ nodes, whose vertex 
degrees are at most one. 
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where A4 is the set of all maps /i satisfying (ITU)) . In practice we find that a simple three-pass algorithm 
performs well at typical firing rates (optimizing tt using a greedy-in-time-difference algorithm for matching, 
then with tt fixed, optimizing ^ via two passes of greedy matching). We do not know the complexity of 
finding the exact solution. 

3.1 A simple time-series spike sorting algorithm 

As an illustrative spike sorter we will use the following model-based fitting algorithm which handles over¬ 
lapping spikes (the full description we leave for a future publication). We assume that the time series Y has 
been low-pass filtered. Firstly, upsampled clips are extracted from Y using the triggering and upsampling 
procedure of App. lA.ll with a —100 fj,V threshold. Secondly, these clips are fed into the clip-based spike 
sorter of Sec. O using the best of r = 100 repetitions of k-meansH—h, with a preset number of clusters 
K. Thirdly, a set of upsampled mean waveform shapes is found by applying (O to the resulting 

clips and labels. These waveforms are ordered by descending Z 2 -norm. Finally, these waveforms are matched 
to the complete time series Y via a greedy fitting algorithm EH [25]. Specifically, the change in l 2 -norm 
of the residual due to subtraction of each (downsampled, fixed-amplitude) waveform at each time-shift was 
computed, and firing events tj were declared at sufficiently negative local minima (wth respect to time shift) 
of this function, that were also global minima over all waveforms with time shifts up to ±0.5 ms from tj. 
The corresponding kj was the label of the waveform responsible for this largest decrease in residual. This 
greedy sweep through the time series was repeated until no further change occurred. Note that, as with 
many modern algorithms [34l [9] [7| [12 [25] the main fitting stage involves no concept of “clips” or event 
windows. 

As in Sec. 12.21 a simple rerunning validation metric would be possible for time series. However, we present 
the following two metrics as much more realistic indicators of stability. 

3.2 Noise-reversal 

The noise-reversal idea of Sec. 12.4.21 generalizes to time series, handling overlapping spikes naturally, as 
follows. Firstly, the sorter runs on the unperturbed data to give S(Y) = {tj Since the interface does 

not output waveforms, they are next estimated by averaging windows from Y centered on the output firing 
times tj for the appropriate label, eg 

= — X! ym,tj+t-T/2 , 771 = 1, . . . , M, f = 1, . . . , T . (15) 

Tlk , , 

j-.kj=k 

The “forward model” generating a (noise-free) time series from a new set of firing times s := and 

labels 1 := is then F(s,l), with entries 

N 

F^t = F{s, lUt := ^ > m = 1,..., M, t = 1,..., . (16) 

1=1 

(We in fact use variants of ()15l) " ()16|l which achieve sub-sample firing time accuracy by upsampling the 
waveforms as in App. lA.H and building downsampling into F.) Using this, a perturbed time series 
(analogous to (O) is generated, 

f = 2F(t,k)-r, (17) 

^In the case of high firing rates, a least-squares solution of a large linear system may provide an improved estimation [25l[7]. 
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Figure 7: Time-series stability metric results, (a) Toy example {K = 2) of a matching /r between one spike 
sorter output tj , kj and another t',, /c', (blue shows label k = 1 and red label k = 2). The resulting extended 
confusion matrix and its “best” permutation Q are shown. The matching ^ shown is the best matching /t 
even though the nearby pair is swapped in time, (b)-(f) show results for the spike sorter of Sec. 13.11 with 
K = 10. (b) Waveform shapes {V'^^kk=i populations from the unperturbed run, and (c) from the 
noise-reversed run. (d) Input time series data Y, and noise-reversed time series Y, showing firing times and 
(best-permuted) labels from each run. (e) Stability metrics fk for noise-reversal (shown in orange), and for 
the spike addition metric (black). (f) Spike firing cross-correlations of spike sorter output tj , kj. Refractory 
dips (of at least ±3 ms) are complete for fc = 1,..., 5 and partial for fc = 6,10, matching the metrics in (e). 
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and sorted to give S{Y) = jt' /, Finally, the best extended confusion matrix Q between the two sets 

of outputs is found as in (fTTli -(lT ^ . and the usual stability metric fk computed via 0. 

Fig. [7] illustrates this metric for spike sorting filtered time series data Y taken from an ex vivo retinal 
multi-electrode array recording described in App. IA.31 with AT = 10 neurons assumed. This dataset has a 
large fraction of overlapping events: spike sorting as in Sec. 13.II results in around 10% of firing events falling 
within 0.1 ms of another event. The upper half of Fig. [7l^d) shows a short time window from Y, while the 
lower half shows the resulting Y. Note that some of the sorted times and labels have changed, while between 
firing events the signal is negated. The orange dots in Fig. [7l[e) show the resulting values: the five 
neurons with largest waveform norms are quite stable, while k = 6 and 10 are marginally stable (recalling 
from Sec. 12.4.21 that 90% is hardly a stable noise-reversal metric), and fc = 7, 8,9 are unstable and—^judging 
by their huge Uk —massively overfit. This matches well the per-neuron physiological validation based on 
clarity of refractory dips in the firing time auto-correlations in Fig. [7Kf). 

It is worth examining why k = 9 has almost zero stability. The cause is apparent in panels (b)-(c), which 
show the waveforms estimated via (I15I) from the unperturbed and noise-reversed runs (the latter in 
“best permuted” ordering tt). The two sets of waveforms are very similar apart from the possibly distinct 
neuron A: = 9 in (b) which has disappeared in (c) due to the splitting of fc = 7. We find that such clustering 
instabilities are common, and vary even with rerunning the sorter with a different random seed in k-meansH—h. 
Our metric quantifies these instabilities as induced by variations consistent with any even-symmetric noise 
model. 

3.3 Auxiliary spike addition 

The final stability metric we present perturbs the time series by linearly adding new spikes at known random 
firing times and then measuring their accuracy upon sorting; this gauges performance in the context of 
overlapping spikes—by creating and assessing new overlaps—while respecting the linearity of the physical 
model for extracellular signals [3]. 

Let the first uperturbed sorter run produce times and labels S{Y) = {tj,kj}^li, from which a forward 
model is built, as above via (HSli-dlll). An estimate for the firing rate of the fcth neuron is nu/T^of, where Uk 
is given by ([H]). One generates an auxiliary set of events {sj, as the union of K independent random 

Poissonian spike trains, with rates Puk/T^^^, k = 1,..., K, where /3 is a rate scaling parameter. If /3 is too 
small, not much is learned from the perturbation; if too large, the total firing rates are made unrealistically 
large. We fix /3 = 0.25. The perturbed time series is then 

y = r + F(s,i), (18) 

which is sorted to give S(Y) = {t'/, fc',}^h]^. The best extended confusion matrix Q is now found (as in 
(EID-iIIl) between the total of Ng + Na events t U s, k U 1, and the second run outputs t', k'. To assess the 
ehange in Q induced by adding spikes, we compute the matrix 

gadd _ Q-diag{ni,...,nif,0} 

where diag denotes the diagonal matrix with diagonal elements as listed. Finally, is computed as in 
© using One may interpret this diagonal subtraction as follows: if all the new spikes were correctly 

sorted, and all the original spikes sorted as in the first run, would be diagonal with entries given by the 
added populations for each label, so = 1 for all fc. Thus low values indicate either lack of sorting 
accuracy for the added spikes, or induced instability (eg false positives or negatives) in re-sorting the original 
spikes. Both of these are important warning signs that a putative neuron is less believable. 
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Interface 

Metric 

Section 

Advantages 

Disadvantages 

clip based 

rerun 

3-way CV 
self-blurring 
noise reversal 

Sec. 

Sec. 

Sec. 

Sec. 

2.21 

O 

2X1 

2.4.2 


simplicity 

simplicity 

adjustable parameter 7 

only needs 2 runs; no extra noise 

useless if sorter is deterministic 

classifier needed; not useful for large N 
raises noise level (can underestimate accuracy) 
stability range close to 1 

time series 

noise reversal 

spike addition 

Sec. 

Sec. 

3.21 

oi 

only needs 2 runs; no extra noise 

accuracy in context; no extra noise 

stability range close to 1 ; reverses noise polarity 

firing rate slightly higher; no built-in shape variation 


Table 1: Summary of the proposed validation metrics for clip-based and time series spike sorting algorithms. 


Remark 3. Marre et al. m and Rossant et al. m use similar but distinct validation metrics: they add 
spikes with known firing times from either a spatially translated neuron, or a “donor” neuron from a different 
recording, then assess accuracy for this neuron. While useful as overall measures of reliability, these methods 
reguire dataset-specific adaptations, and it is not clear that they validate any of the particular neurons in 
the original data. In contrast, the metric we propose here validates each neuron as sorted in the dataset of 
interest. 

Fig.m e) shows 20 independent samples of f^'^ (ie, 20 realizations of the Poisson spike trains), with 
the black dots showing the means. Only the hrst five neurons have stabilities above 60%. The overall 
pattern is similar to noise-reversal, but with a lower overall calibration of the metric, and it matches well the 
independent validation via cross-correlations in Fig.[7l)f). Note that if adding a certain type of spike induces 
changes in the sorting of the (more numerous) existing spikes, then huge drops in stability, including highly 
negative are possible, as visible in Fig. [TKe), k = 8,9. Indeed, from its waveform and huge population 
in Fig. [7Kb), k = 8 would likely be classified as the “noise cluster” (see eg [38]) by an experienced human. 

4 Discussion and conclusions 

We have proposed several schemes for the validation of automatic spike sorting algorithms, each of which 
outputs a per-neuron stability metric in the context of the dataset of interest, without ground truth infor¬ 
mation or physiological criteria, and with few assumptions about the noise model. Our key contribution is 
a common framework (Fig. [1]) where this validation occurs only through a simple standardized interface to 
the sorter, which is treated as a black box that can be invoked multiple times. We see this as essential to 
automated validation and benchmarking of a growing variety of spike sorting algorithms and codes. By con¬ 
trast, most existing validation metrics rely, in part, on accessing internal parameters peculiar only to certain 
algorithms. We also envisage such metrics as the first filter in a laboratory pipeline in which—due to the 
increasing scale of multi-electrode recordings—it will be impossible for a human operator to make decisions 
about most neurons. While stability cannot guarantee accuracy or single-unit activity (eg see Sec. I2.4.3|) . 
instability is a useful warning of likely inaccuracy or multi-unit activity. Only neurons that pass the stability 
test should then be further validated based on eg shape [23 1 refractory gaps [T3] and/or receptive fields 

[MlITlIll]. 

Table [1] summarizes the six schemes: the first four use a simple sorting interface that classifies “clips” 
(spiking events), while the remaining two use a more general interface that accesses the full time series 
and allows handling of overlapping spikes. All schemes resample or perturb the data in a realistic way, 
then measure how close the resulting neuron-to-neuron best confusion matrix Q is to being diagonal; its off- 
diagonal structure also indicates which neurons are coupled by instabilities. Although our demonstrations 
used standard sorting algorithms with a specified number of neurons K, the metrics also apply to variable 
K with minor extra bookkeeping. 
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The first two schemes (rerunning and cross-validation) are conceptually simple but are shown to have 
severe limitations. The next two (self-blurring and noise reversal for clip sorting) seem useful even in large- 
scale settings. However, we expect that the last two (noise reversal and spike addition for time series) will be 
the most useful. Each metric carries its own calibration: for instance, with the sorter fixed, for noise reversal 
fk is much closer to 1 than for spike addition. A future task is to calibrate the metrics against accuracy with 
both synthetic and ground-truth data. 

Although general, our schemes make certain assumptions. Noise reversal assumes sign-symmetry in the 
noise distribution (which may not be satisfied for “noise” due to distant spikes), whereas spike addition as 
presented does not account for realistic firing variation in the added spikes; the latter could be achieved in 
a model-free way by adding random spike instances (as in self-blurring) instead of mean waveforms. 

We believe that several other extensions are worth exploring. 1) The last two validation schemes essen¬ 
tially involve reverse-engineering a forward model (HU) from a single spike sorter run. This model could be 
improved in various ways, by making use of firing amplitude information (if provided by the sorter), or, in 
nonstationary settings (waveform drift), constructing waveform shapes from only a local moving window. 
2) The random known times in spike addition could be chosen to respect the refractory periods of the ex¬ 
isting spikes, avoiding rare but unrealistic self-overlaps. Added spikes could also be chosen in a bursting 
sequence that tests the sorter’s accuracy in this difficult setting. 3) A variant of self-blurring that estimates 
' 9 /fe*“V' 57 l 7 =o could avoid the problem of noise level increase (cluster inflation) at larger 7 . 

Code which implements the validation metrics and sorting algorithms used is freely available (in MATLAB 
with a MEX interface to C) at the following URL: 

https://github.com/ahbarnett/validspike 
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A Provenance of test datasets 

A.l Clips dataset 

We started with an electrical recording, supplied to us by the Buzsaki Lab at NYU Medical Center, taken 
from the motor cortex of a freely behaving rat at 20 kHz sampling rate with a NeuroNexus “Buzsaki64sp” 
probe m- We took 3 X 10® consecute time samples from a single shank with M = 10 channels. The following 
procedure to extract clips combines standard techniques in the literature. 

We first high-pass filtered at 300 Hz, using the EFT and a frequency filter function a(/) = (1 -I- tanh[(/ — 
300)/100])/2, where / > 0 is the frequency in Hz. This gives a M-by-3 x 10® matrix Y, which was then 
spatially prewhitened by replacing by QY, where Q is R~^ but with each row normalized to unit I 2 norm. 

R = C is the Cholesky factorization of C, an estimate of the M xM channel-wise cross-correlation matrix. 
C itself was estimated using 10'^ randomly-chosen “noise clips” (time-series segments of duration 4 ms for 
which no channel exceeded a threshold of 158 /rV). This prewhitening significantly improved the noise level, 
and helped remove common-mode events. Events were triggered by the minimum across channels passing 
below —120 /iV (around 4 times the estimated noise standard deviation); events where this trigger lasted 


18 


longer than 1 ms, or where another trigger occurred within 2 ms either side of the triggering event, were 
discarded as unlikely to be due to a single spiking event. This gave 6901 clips of length 60 samples each. 

Clips were then upsampled by a factor of 3, using interpolation from the sampling grid by the Hann- 
windowed sine kernel [5] of width r = 5 samples, 

r 1, t = o, 

f{t) = 0, \t\ > T, 

[ 2 Mcos 2 (fA), otherwise, 

where t is given in the original sample units. The kernel width reduced the clip duration from 3 ms to 2.45 
ms. Finally, the upsampled clips were aligned by time translation by an integer number of upsampled grid 
points until the minimum across channels lies at the central grid point. When translated, values either side 
of the data were padded with zeros (on average less than one padded zero per clip per channel was needed). 
The upsampling allows alignment to be more accurate and improves cluster quality [2]. The result was a 
M-hy-T-hy-N array of clips X with M = 10, T = 147, and N = 6901. 

A.2 Larger clips dataset 

In Sections 12.31 and 12.41 datasets comprising a larger number of clips N were needed. Since we did not have 
stationary time series containing such large numbers of clips, we generated them from the above clips as 
follows. Let F be the factor to grow the number of clips. Each clip was duplicated F times, then to each 
channel of each new clip was added time-correlated Gaussian noise with variance where ry = 30 /rV is 
an estimate of the noise standard deviation from pre-whitened “noise clips”, as above. The autocorrelation 
function in time was taken as C{t) = with r = 0.5 ms, giving a rough approximation to the observed 

noise autocorrelation. As is standard, the time-correlated noise samples were generated by applying the 
Cholesky factor of the time-sample autocorrelation matrix to an iid Gaussian signal of the desired length. 

A.3 Time-series dataset 

Our time-series data Y comes from M = 7 adjacent electrodes (spatially forming a hexagon around a 
central electrode) within a 512-electrode array [H] recording at a 20 kHz sample rate from the spontaneous 
activation of an ex vivo monkey retina. The recording length was 2 minutes (Tt^t = 2.4 x 10® samples), and 
mean neuron spiking rates are of order 20 Hz. This was supplied to us by the Ghichilnisky Lab at Stanford. 
The raw data was then high-pass filtered at 300 Hz as in App. lA.ll Since there was little noise correlation 
between channels, no spatial prewhitening was done. 

B Analytic stability for a Gaussian erroneously split in two 

When the N clips are drawn from some underlying probability distribution function (pdf) p in signal or 
feature space, we have analytic results as A^ —>■ oo for the stability of a single cluster when split into two 
symmetrically by a decision hyperplane (eg by k-means with K = 2). Gonsider as in Fig. [lid), the ID case 
z G M, with p{z) symmetric about the splitting point z = 0 (without loss of generality). The centroid of 
the positive half of the pdf is c = 2 zp{z)dz. The stability (of either of the two labels) is the mass that 
does not change label, which, specializing to 7 = 1 , is the chance that the sum of two independent positive 
samples from p exceeds c, 

/)>'“'= 1 - 4 / f p{x)p{y)dxdy . (19) 
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Choosing a Gaussian p{z) = (l/v^)e“^^ for which c = the second term in (ITOl) is the mass in a 

diamond region which, by the rotational invariance of p{x)p{y), may be rotated by 7 r /4 to become the square 
|a:|, \y\ < cl^/2. Thus 


fT' = 1 - 


rc/\pl - 2 

2 / p{x)dx 


1 - [erf(l/V^)]2 =0.817..., 


L ^0 J 

where the usual error function is defined in [531 (7-2.1)]. 

For noise reversal, stability is simply the probability that a single sample has \z\ < 2c, 

= erf(2c) = erf(2/v^) « 0.889 ... 

Both of these cases apply to a multivariate Gaussian split on a symmetry hyperplane, by taking z as the 
coordinate normal to the hyperplane. 


References 

[1] D. Arthur and S. Vassilvitskii. k-means++: the advantages of careful seeding. In In Proceedings of the 
18th Annual ACM-SIAM Symposium on Discrete Algorithms, 2007. 

[2] T. J. Blanche and N. V. Swindale. Nyquist interpolation improves neuron yield in multiunit recordings. 
J. Neurosci. Meth., 155:81-91, 2006. 

[3] G. Buzsaki, C. A. Anastassiou, and C. Koch. The origin of extracellular fields and currents—EEG, 
ECoG, LFP and spikes. Nature Reviews Neurosci., 13:407-420, 2012. 

[4] L. A. Camuhas-Mesa and R. Q. Quiroga. A detailed and fast model of extracellular recordings. Neural 
Comput, 25:1191-1212, 2013. 

[5] D. E. Carlson, J. T. Vogelstein, Q. Wu, W. Lian, M. Zhou, C. R. Stoetzner, D. Kipke, D. Weber, D. B. 
Dunson, and L. Garin. Multichannel electrophysiology spike sorting via joint dictionary learning & 
mixture modeling. IEEE Trans. Biomed. Eng., 2013. 

[ 6 ] G. T. Einevoll, F. Franke, E. Hagen, C. Pouzat, and K. D. Harris. Towards reliable spike-train recordings 
from thousands of neurons with multielectrodes. Current Opinion in Neurobiology, 22(1):11-17, 2012. 

[7] C. Ekanadham, D. Tranchina, and E. P. Simoncelli. A unified framework and method for automatic 
neural spike identification. J. Neurosci. Meth., 222:47-55, 2013. 

[ 8 ] M. S. Fee, P. P. Mitra, and D. Kleinfeld. Automatic sorting of multiple unit neuronal signals in the 
presence of anisotropic and non-Gaussian variability. J. Neurosci. Meth., 69:175-188, 1996. 

[9] F. Franke, M. Natora, C. Boucsein, M. H. J. Munk, and K. Obermayer. An online spike detection 
and spike classification algorithm capable of instantaneous resolution of overlapping spikes. J Comput. 
Neurosci., 29:127-148, 2010. 

[10] S. Gibson, J. W. Judy, and D. Markovic. Spike sorting: The first step in decoding the brain. IEEE 
Signal Proc. Magazine, pages 124-143, January 2012. 


20 




[11] K. D. Harris, H. D. A, J. Csicsvari, H. Hirase, and G. Buzsaki. Accuracy of tetrode spike separation as 
determined by simultaneous intracellular and extracellular measurements. J. Neurophysiology, 84:401- 
414, 2000. 

[12] C. Hennig. Cluster-wise assessment of cluster stability. Comput. Stat. Data An., 52(1):258-271, 2007. 

[13] D. N. Hill, S. B. Mehta, and D. Kleinfeld. Quality metrics to accompany spike sorting of extracellular 
signals. J. Neurosci., 31(24):8699-8705, 2011. 

[14] A. M. Krieger and P. E. Green. A cautionary note on using internal cross validation to select the number 
of clusters. Psychometrika, 64(3):341~353, 1999. 

[15] H. W. Kuhn. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 
2:83-97, 1955. 

[16] T. Lange, V. Roth, M. L. Braun, and J. M. Buhmann. Stability-based validation of clustering solutions. 
Neural computation, 16(6):1299-1323, 2004. 

[17] M. S. Lewicki. A review of methods for spike sorting: the detection and classification of neural action 
potentials. Network: Comput. Neural Syst., 9:R53-R78, 1998. 

[18] P. H. Li, J. L. Gauthier, M. Schiff, A. Sher, D. Ahn, G. D. Field, M. Greschner, E. M. Callaway, A. M. 
Litke, and E. J. Chichilnisky. Anatomical Identification of Extracellularly Recorded Cells in Large-Scale 
Multielectrode Recordings. J. Neurosci., 35(ll):4663-4675, 2015. 

[19] A. M. Litke, N. Bezayiff, E. J. Chichilnisky, W. Cunningham, W. Dabrowski, A. A. Grillo, M. Grivich, 
P. Grybos, P. Hottowy, S. Kachiguine, R. S. Kalmar, K. Mathieson, D. P. D, M. Rahman, and A. Sher. 
What does the eye tell the brain? development of a system for the large scale recording of retinal output 
activity. IEEE Transactions on Nuclear Science, 51(4):1434-1440, 2004. 

[20] D. J. C. MacKay. Introduction to Monte Carlo methods. In M. 1. Jordan, editor, Learning in Graphical 
Models, NATO Science Series, pages 175-204. Kluwer Academic Press, 1998. 

[21] O. Marre, D. Amodei, N. Deshmukh, K. Sadeghi, F. Soo, T. E. Holy, and M. J. B. 11. Mapping a 
complete neural population in the retina. J. Neurosci., 32(43):14859-14873, 2012. 

[22] C. Moore-Kochlacs, J. Scholvin, J. P. Kinney, J. G. Bernstein, Y. Yoon, S. K. Arfin, N. Kopell, and 
E. S. Boyden. Principles of high-fidelity, high-density 3-d neural recording. BMC Neuroscience, 15(Suppl 
1):P122, 2014. 

[23] S. A. Neymotin, W. W. Lytton, A. V. Olypher, and A. A. Fenton. Measuring the Quality of Neuronal 
Identification in Ensemble Recordings. J. Neurosci., 31(45):16398-16409, 2011. 

[24] F. W. J. Olver, D. W. Lozier, R. F. Boisvert, and C. W. Clark, editors. NIST Handbook of Mathematical 
Functions. Cambridge University Press, 2010. http://dlmf.nist.gov. 

[25] J. W. Pillow, J. Shlens, E. J. Chichilnisky, and E. P. Simoncelli. A model-based spike sorting algorithm 
for removing correlation artifacts in multi-neuron recordings. PLoS ONE, 8(5):e62123, 2013. 

[26] C. Pouzat, O. Mazor, and G. Laurent. Using noise signature to optimize spike-sorting and to assess 
neuronal classification quality. J. Neurosci. Meth., 122:43-57, 2002. 


21 



[27] J. S. Prentice, J. Homann, K. D. Simmons, G. Tkacik, V. Balasubramanian, and P. C. Nelson. Fast, 
scalable, Bayesian spike identification and multi-electrode arrays. PLoS ONE, 6(7):el9884, 2011. 

[28] R. Q. Quiroga. Spike sorting. Scholarpedia, 2(12):3583, 2007. 

[29] R. Q. Quiroga. Spike sorting. Current Biology, 22(2):R45-R46, 2012. 

[30] W. M. Rand. Objecive criteria for the evaluation of clustering methods. J. Am. Stat. Assoc., 66:846-850, 
1971. 

[31] C. Rossant, S. Kadir, D. F. M. Goodman, J. Schulman, M. Belluscio, G. Buzsaki, and K. D. 
Harris. Spike sorting for large, dense electrode arrays, 2015. Biorxiv preprint available at 
http ://dx.doi.org/10.1101/015198. 

[32] N. Schmitzer-Torbert, J. Jackson, D. Henze, K. Harris, and a. D. Redish. Quantitative measures of 
cluster quality for use in extracellular recordings. Neuroscience, 131(1):1-11, 2005. 

[33] O. Shamir and N. Tishby. On the reliability of clustering stability in the large sample regime. Advances 
in neural information processing systems (NIPS), pages 1465-1472, 2009. 

[34] S. Takahashi, Y. Sakurai, M. Tsukada, and Y. Anzai. Classification of neuronal activities from tetrode 
recordings using independent component analysis. Neurocomputing, 49:289-298, 2002. 

[35] A. Tankus, Y. Yeshurun, and I. Fried. An automatic measure for classifying clusters of suspected spikes 
into single cells versus multiunits. J Neural Eng., 6(5):056001, 2009. 

[36] U. von Luxburg. Clustering stability: An overview. Foundations and Trends in Machine Learning, 
2(3):235-274, 2009. 

[37] B. O. Watson and G. Buzsaki. Personal communication, 2015. 

[38] J. Wild, Z. Prekopcsak, T. Sieger, D. Novak, and R. Jech. Performance comparison of extracellular 
spike sorting algorithms for single-channel recordings. J. Neurosci. Meth., 203(2):369-376, 2012. 

[39] F. Wood and M. J. Black. A non-parametric Bayesian alternative to spike sorting. J. Neurosci. Meth., 
173:1-12, 2008. 

[40] B. Yu. Stability. Bernoulli, 19:1484-1500, 2013. 

[41] M. J. Zaki and W. Meira Jr. Data mining and analysis: fundamental concepts and algorithms. Cam¬ 
bridge University Press, New York, NY, 2014. 


22 



